1. Introduction

“The German Credit data has data on 1000 past credit applicants, described by 30 variables. Each applicant is rated as”Good” or “Bad” credit (encoded as 1 and 0 respectively in the response variable). We want to obtain a model that may be used to determine if new applicants present a good or bad credit risk.”

In this case, we are planning to use both CART model and Clustering to do the analysis.

2. Exploratory data analysis

2.1 Data structure and summary

Q: Shall we change all the variables which are not numerical into factor?

#> 'data.frame':    1000 obs. of  32 variables:
#>  $ OBS.            : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ CHK_ACCT        : Factor w/ 4 levels "0","1","2","3": 1 2 4 1 1 ..
#>  $ DURATION        : int  6 48 12 42 24 36 24 36 12 30 ...
#>  $ HISTORY         : Factor w/ 5 levels "0","1","2","3",..: 5 3 5 3..
#>  $ NEW_CAR         : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 ..
#>  $ USED_CAR        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 ..
#>  $ FURNITURE       : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 1 ..
#>  $ RADIO.TV        : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 2 ..
#>  $ EDUCATION       : Factor w/ 3 levels "-1","0","1": 2 2 3 2 2 3 2..
#>  $ RETRAINING      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 ..
#>  $ AMOUNT          : int  1169 5951 2096 7882 4870 9055 2835 6948 3..
#>  $ SAV_ACCT        : Factor w/ 5 levels "0","1","2","3",..: 5 1 1 1..
#>  $ EMPLOYMENT      : Factor w/ 5 levels "0","1","2","3",..: 5 3 4 4..
#>  $ INSTALL_RATE    : int  4 2 2 2 3 2 3 2 2 4 ...
#>  $ MALE_DIV        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 ..
#>  $ MALE_SINGLE     : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 1 ..
#>  $ MALE_MAR_or_WID : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 ..
#>  $ CO.APPLICANT    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 ..
#>  $ GUARANTOR       : Factor w/ 3 levels "0","1","2": 1 1 1 2 1 1 1 ..
#>  $ PRESENT_RESIDENT: Factor w/ 4 levels "1","2","3","4": 4 2 3 4 4 ..
#>  $ REAL_ESTATE     : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 2 ..
#>  $ PROP_UNKN_NONE  : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 1 1 ..
#>  $ AGE             : int  67 22 49 45 53 35 53 35 61 28 ...
#>  $ OTHER_INSTALL   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 ..
#>  $ RENT            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 ..
#>  $ OWN_RES         : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 2 ..
#>  $ NUM_CREDITS     : int  2 1 1 1 2 1 1 1 1 2 ...
#>  $ JOB             : Factor w/ 4 levels "0","1","2","3": 3 3 2 3 3 ..
#>  $ NUM_DEPENDENTS  : int  1 1 2 2 2 2 1 1 1 1 ...
#>  $ TELEPHONE       : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 2 1 ..
#>  $ FOREIGN         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 ..
#>  $ RESPONSE        : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 ..
#>       OBS.      CHK_ACCT    DURATION    HISTORY NEW_CAR USED_CAR
#>  Min.   :   1   0:274    Min.   : 4.0   0: 40   0:766   0:897   
#>  1st Qu.: 251   1:269    1st Qu.:12.0   1: 49   1:234   1:103   
#>  Median : 500   2: 63    Median :18.0   2:530                   
#>  Mean   : 500   3:394    Mean   :20.9   3: 88                   
#>  3rd Qu.: 750            3rd Qu.:24.0   4:293                   
#>  Max.   :1000            Max.   :72.0                           
#>  FURNITURE RADIO.TV EDUCATION RETRAINING     AMOUNT      SAV_ACCT
#>  0:819     0:720    -1:  1    0:903      Min.   :  250   0:603   
#>  1:181     1:280    0 :950    1: 97      1st Qu.: 1366   1:103   
#>                     1 : 49               Median : 2320   2: 63   
#>                                          Mean   : 3271   3: 48   
#>                                          3rd Qu.: 3972   4:183   
#>                                          Max.   :18424           
#>  EMPLOYMENT  INSTALL_RATE  MALE_DIV MALE_SINGLE MALE_MAR_or_WID
#>  0: 62      Min.   :1.00   0:950    0:452       0:908          
#>  1:172      1st Qu.:2.00   1: 50    1:548       1: 92          
#>  2:339      Median :3.00                                       
#>  3:174      Mean   :2.97                                       
#>  4:253      3rd Qu.:4.00                                       
#>             Max.   :4.00                                       
#>  CO.APPLICANT GUARANTOR PRESENT_RESIDENT REAL_ESTATE PROP_UNKN_NONE
#>  0:959        0:948     1:130            0:718       0:846         
#>  1: 41        1: 51     2:308            1:282       1:154         
#>               2:  1     3:149                                      
#>                         4:413                                      
#>                                                                    
#>                                                                    
#>       AGE        OTHER_INSTALL RENT    OWN_RES  NUM_CREDITS  
#>  Min.   : 19.0   0:814         0:821   0:287   Min.   :1.00  
#>  1st Qu.: 27.0   1:186         1:179   1:713   1st Qu.:1.00  
#>  Median : 33.0                                 Median :1.00  
#>  Mean   : 35.6                                 Mean   :1.41  
#>  3rd Qu.: 42.0                                 3rd Qu.:2.00  
#>  Max.   :125.0                                 Max.   :4.00  
#>  JOB     NUM_DEPENDENTS TELEPHONE FOREIGN RESPONSE
#>  0: 22   Min.   :1.00   0:596     0:963   0:300   
#>  1:200   1st Qu.:1.00   1:404     1: 37   1:700   
#>  2:630   Median :1.00                             
#>  3:148   Mean   :1.16                             
#>          3rd Qu.:1.00                             
#>          Max.   :2.00

Data Frame Summary

Dimensions: 1000 x 32
Duplicates: 0
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 OBS. [integer]
Mean (sd) : 500 (289)
min ≤ med ≤ max:
1 ≤ 500 ≤ 1000
IQR (CV) : 500 (0.6)
1000 distinct values (Integer sequence) 1000 (100.0%) 0 (0.0%)
2 CHK_ACCT [factor]
1. 0
2. 1
3. 2
4. 3
274(27.4%)
269(26.9%)
63(6.3%)
394(39.4%)
1000 (100.0%) 0 (0.0%)
3 DURATION [integer]
Mean (sd) : 20.9 (12.1)
min ≤ med ≤ max:
4 ≤ 18 ≤ 72
IQR (CV) : 12 (0.6)
33 distinct values 1000 (100.0%) 0 (0.0%)
4 HISTORY [factor]
1. 0
2. 1
3. 2
4. 3
5. 4
40(4.0%)
49(4.9%)
530(53.0%)
88(8.8%)
293(29.3%)
1000 (100.0%) 0 (0.0%)
5 NEW_CAR [factor]
1. 0
2. 1
766(76.6%)
234(23.4%)
1000 (100.0%) 0 (0.0%)
6 USED_CAR [factor]
1. 0
2. 1
897(89.7%)
103(10.3%)
1000 (100.0%) 0 (0.0%)
7 FURNITURE [factor]
1. 0
2. 1
819(81.9%)
181(18.1%)
1000 (100.0%) 0 (0.0%)
8 RADIO.TV [factor]
1. 0
2. 1
720(72.0%)
280(28.0%)
1000 (100.0%) 0 (0.0%)
9 EDUCATION [factor]
1. -1
2. 0
3. 1
1(0.1%)
950(95.0%)
49(4.9%)
1000 (100.0%) 0 (0.0%)
10 RETRAINING [factor]
1. 0
2. 1
903(90.3%)
97(9.7%)
1000 (100.0%) 0 (0.0%)
11 AMOUNT [integer]
Mean (sd) : 3271 (2823)
min ≤ med ≤ max:
250 ≤ 2320 ≤ 18424
IQR (CV) : 2607 (0.9)
921 distinct values 1000 (100.0%) 0 (0.0%)
12 SAV_ACCT [factor]
1. 0
2. 1
3. 2
4. 3
5. 4
603(60.3%)
103(10.3%)
63(6.3%)
48(4.8%)
183(18.3%)
1000 (100.0%) 0 (0.0%)
13 EMPLOYMENT [factor]
1. 0
2. 1
3. 2
4. 3
5. 4
62(6.2%)
172(17.2%)
339(33.9%)
174(17.4%)
253(25.3%)
1000 (100.0%) 0 (0.0%)
14 INSTALL_RATE [integer]
Mean (sd) : 3 (1.1)
min ≤ med ≤ max:
1 ≤ 3 ≤ 4
IQR (CV) : 2 (0.4)
1:136(13.6%)
2:231(23.1%)
3:157(15.7%)
4:476(47.6%)
1000 (100.0%) 0 (0.0%)
15 MALE_DIV [factor]
1. 0
2. 1
950(95.0%)
50(5.0%)
1000 (100.0%) 0 (0.0%)
16 MALE_SINGLE [factor]
1. 0
2. 1
452(45.2%)
548(54.8%)
1000 (100.0%) 0 (0.0%)
17 MALE_MAR_or_WID [factor]
1. 0
2. 1
908(90.8%)
92(9.2%)
1000 (100.0%) 0 (0.0%)
18 CO.APPLICANT [factor]
1. 0
2. 1
959(95.9%)
41(4.1%)
1000 (100.0%) 0 (0.0%)
19 GUARANTOR [factor]
1. 0
2. 1
3. 2
948(94.8%)
51(5.1%)
1(0.1%)
1000 (100.0%) 0 (0.0%)
20 PRESENT_RESIDENT [factor]
1. 1
2. 2
3. 3
4. 4
130(13.0%)
308(30.8%)
149(14.9%)
413(41.3%)
1000 (100.0%) 0 (0.0%)
21 REAL_ESTATE [factor]
1. 0
2. 1
718(71.8%)
282(28.2%)
1000 (100.0%) 0 (0.0%)
22 PROP_UNKN_NONE [factor]
1. 0
2. 1
846(84.6%)
154(15.4%)
1000 (100.0%) 0 (0.0%)
23 AGE [integer]
Mean (sd) : 35.6 (11.7)
min ≤ med ≤ max:
19 ≤ 33 ≤ 125
IQR (CV) : 15 (0.3)
54 distinct values 1000 (100.0%) 0 (0.0%)
24 OTHER_INSTALL [factor]
1. 0
2. 1
814(81.4%)
186(18.6%)
1000 (100.0%) 0 (0.0%)
25 RENT [factor]
1. 0
2. 1
821(82.1%)
179(17.9%)
1000 (100.0%) 0 (0.0%)
26 OWN_RES [factor]
1. 0
2. 1
287(28.7%)
713(71.3%)
1000 (100.0%) 0 (0.0%)
27 NUM_CREDITS [integer]
Mean (sd) : 1.4 (0.6)
min ≤ med ≤ max:
1 ≤ 1 ≤ 4
IQR (CV) : 1 (0.4)
1:633(63.3%)
2:333(33.3%)
3:28(2.8%)
4:6(0.6%)
1000 (100.0%) 0 (0.0%)
28 JOB [factor]
1. 0
2. 1
3. 2
4. 3
22(2.2%)
200(20.0%)
630(63.0%)
148(14.8%)
1000 (100.0%) 0 (0.0%)
29 NUM_DEPENDENTS [integer]
Min : 1
Mean : 1.2
Max : 2
1:845(84.5%)
2:155(15.5%)
1000 (100.0%) 0 (0.0%)
30 TELEPHONE [factor]
1. 0
2. 1
596(59.6%)
404(40.4%)
1000 (100.0%) 0 (0.0%)
31 FOREIGN [factor]
1. 0
2. 1
963(96.3%)
37(3.7%)
1000 (100.0%) 0 (0.0%)
32 RESPONSE [factor]
1. 0
2. 1
300(30.0%)
700(70.0%)
1000 (100.0%) 0 (0.0%)

Generated by summarytools 1.0.0 (R version 4.1.2)
2022-04-28

2.2 Graphs for variables

2.2.1 Graph for response

2.2.2 Graphs for numeric variables

2.2.3 Graphs for the purpose of credit variables

2.2.4 Graphs for other categorical variables

2.2.5 Observation profiling

In this part, we would like to draw the profilling of observations ditinguished by different response. Thereby, it is possible to observe the performance / situation of observations subjects with good credit in each variable through the portrait.
At the beginning, we want to use boxplot as above, but it is not suitable for binary variables

2.2.6 Correlation plot

3 Data preparation

3.1 Balancing data

#> 
#>   0   1 
#> 240 560
#> 
#>   0   1 
#> 560 560

3.2 Cross validation

#> CART 
#> 
#> 1120 samples
#>   31 predictor
#>    2 classes: '0', '1' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 1008, 1008, 1008, 1008, 1008, 1008, ... 
#> Resampling results across tuning parameters:
#> 
#>   cp      Accuracy  Kappa 
#>   0.0223  0.653     0.3054
#>   0.0357  0.641     0.2821
#>   0.3071  0.546     0.0911
#> 
#> Accuracy was used to select the optimal model using the
#>  largest value.
#> The final value used for the model was cp = 0.0223.
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 42 47
#>          1 18 93
#>                                         
#>                Accuracy : 0.675         
#>                  95% CI : (0.605, 0.739)
#>     No Information Rate : 0.7           
#>     P-Value [Acc > NIR] : 0.802764      
#>                                         
#>                   Kappa : 0.32          
#>                                         
#>  Mcnemar's Test P-Value : 0.000515      
#>                                         
#>             Sensitivity : 0.700         
#>             Specificity : 0.664         
#>          Pos Pred Value : 0.472         
#>          Neg Pred Value : 0.838         
#>              Prevalence : 0.300         
#>          Detection Rate : 0.210         
#>    Detection Prevalence : 0.445         
#>       Balanced Accuracy : 0.682         
#>                                         
#>        'Positive' Class : 0             
#> 

4. Modelling

4.1 CART

#>     obs
#> Pred  0  1
#>    0 46 41
#>    1 14 99

Clustering

Conclusion